Sex classification from functional brain connectivity: Generalization to multiple datasets

Abstract Machine learning (ML) approaches are increasingly being applied to neuroimaging data. Studies in neuroscience typically have to rely on a limited set of training data which may impair the generalizability of ML models. However, it is still unclear which kind of training sample is best suited to optimize generalization performance. In the present study, we systematically investigated the generalization performance of sex classification models trained on the parcelwise connectivity profile of either single samples or compound samples of two different sizes. Generalization performance was quantified in terms of mean across‐sample classification accuracy and spatial consistency of accurately classifying parcels. Our results indicate that the generalization performance of parcelwise classifiers (pwCs) trained on single dataset samples is dependent on the specific test samples. Certain datasets seem to “match” in the sense that classifiers trained on a sample from one dataset achieved a high accuracy when tested on the respected other one and vice versa. The pwCs trained on the compound samples demonstrated overall highest generalization performance for all test samples, including one derived from a dataset not included in building the training samples. Thus, our results indicate that both a large sample size and a heterogeneous data composition of a training sample have a central role in achieving generalizable results.

ML models learn the feature-target relationship given a training sample.Subsequently, the model is applied to make predictions on previously unseen data (Dhamala et al., 2023) and successful generalization to independent data samples is the central goal in ML (Domingos, 2012;Varoquaux, 2018;Chung, 2018).For example, a recent study (Weis et al., 2020) demonstrated successful generalization of sex prediction models based on regionally specific functional brain connectivity patterns, which were trained on the data of the Human Connectome Project (HCP, Van Essen et al., 2012, Van Essen et al., 2013).For this spatially specific approach, independent classifiers were trained on the functional brain connectivity patterns of parcels covering the whole brain.In this case, assessing generalization performance should not only consider the averaged across-sample accuracy.Rather, if the classifiers generalize well, the same parcels should achieve high classification accuracies during cross-validation (CV) and across-sample testing.
Further sex classification studies (Menon & Krishnamurthy, 2019;Smith, Vidaurre, et al., 2013;Zhang et al., 2018), as well as other applications of ML models employed the HCP dataset to predict phenotypes such as task activation (Cohen et al., 2020), and individual behavioral and demographic scores (Cui & Gong, 2018;Smith et al., 2015) like age (Sanford et al., 2022).The HCP dataset is characterized by high-quality multi-modal imaging data acquired from a large group of healthy young adults.However, both the high quality of the brain imaging data as well as the narrow age range is not typical of other datasets, especially when dealing with clinical data (Arslan, 2018;Jansma et al., 2020;Rutten & Ramsey, 2010).This raises the question whether results based on the HCP data can be generalized to other datasets with different characteristics.Weis et al. (2020) demonstrated that sex classifiers trained on the HCP data generalized well to an independent subset of the HCP dataset as well as to the 1000Brains dataset (Caspers et al., 2014).Additional evidence from the application of such classifiers to data from datasets with diverse characteristics would provide even stronger evidence of model generalization.
Especially in neuroimaging, differences between datasets may result from several different sources.On the one hand, participants may differ with respect to demographic characteristics, such as age, education, or economic status.On the other hand, data samples likely differ with regard to the MRI acquisition parameters and data processing.Considering these differences, it is so far unresolved what kind of training sample leads to good generalization performance across multiple test samples.
Various characteristics of the training data can influence the generalization performance of ML models (Dhamala et al., 2023).For instance, larger sample size is beneficial for generalization performance (Cui & Gong, 2018;Domingos, 2012).Ensuring that the training data is representative of the target sample is another crucial factor for achieving good generalization performance (Ishida, 2019;Yang et al., 2020).Furthermore, data from different acquisition sites are likely heterogeneous with respect to demographic characteristics, data acquisition, and processing parameters.Due to the variability across different datasets and sites, a ML model trained on a compound of such data is more likely to capture the shared biological variability in all datasets while disregarding the variability resulting from differences between the datasets.This distinction supports models focusing solely on the biological variability independent of specific dataset characteristics.Hence, such models are less likely to overfit and more likely to generalize to new data.Thus, aggregating data from multiple sites should be beneficial for improving generalization performance.Indeed, this has been partially shown by studies concerning clinical applications of ML approaches (Chang et al., 2018;Nielsen et al., 2020;Willemink et al., 2020).These results suggest that training ML models on diverse datasets covering a wide range of characteristics may improve the overall generalization performance.
In the present study, we aimed to evaluate the generalization per- testing.We hypothesized that the pwC trained on the compound sample with a smaller sample size should outperform pwCs trained on single samples due to the heterogeneous data composition, while the compound sample with a higher sample size should achieve the overall best generalization performance (Chang et al., 2018;Cui & Gong, 2018;Dhamala et al., 2023;Domingos, 2012;Nielsen et al., 2020;Willemink et al., 2020).

| Data
We employed RS functional magnetic resonance imaging (fMRI) data of subsets of four large datasets to train and test sex classification models.For all datasets, we only included healthy subjects aged 20 years or older.Within each training sample, we matched females and males for age and included a similar number of women and men.
The first sample, taken from the HCP dataset (900 subjects data release; Van Essen et al., 2012;Van Essen, 2013), comprised 878 subjects with a mean age of 28.49 years (range: 22-37 years).The second sample, taken from the Brain Genomics Superstructure Project (GSP; Holmes et al., 2015) comprised 854 subjects with a mean age of 22.92 years (range: 21-35 years).The third sample was a subset from the Rockland Sample of the Enhanced Nathan Klein Institute (eNKI; Nooner et al., 2012), comprising 190 subjects with a mean age of 46.02 years (range: 20-83 years).The fourth sample, taken from the 1000Brains dataset (Caspers et al., 2014), comprised 1000 subjects with a mean age of 61.18 years (range: 21-85 years).This sample was included to examine generalization performance to an older sample.A fifth sample ("compound854") was constructed by combining subsamples of the HCP, GSP, eNKI and 1000Brains samples, with a mean age RS fMRI data from an additional dataset was included to evaluate classifiers on an additional independent sample.This sample comprised 370 subjects (214 females) with a mean age of 22.50 years (range 20-26 years) from the AOMIC dataset (Snoek et al., 2021).It was not additionally balanced for sex to maintain the maximum number of participants for evaluation.Data usage of the included datasets was approved by the Ethics Committee of the Medical Faculty of the Heinrich-Heine University Düsseldorf (4039, 5193, 2018-317-RetroDEuA).All data was collected in research projects approved by a local Review Board, for which all participants provided written informed consent.All experiments were performed in accordance with relevant guidelines and regulations.

| HCP
The RS fMRI data of the HCP dataset were acquired on a Siemens Skyra 3 T MRI scanner with multiband echo-planar imaging with a duration of 873 s and the following parameters: 72 slices; voxel size, 2 Â 2 Â 2 mm 3 ; field of view (FOV), 208 Â 180 mm 2 ; matrix, 104 Â 90; TR, 720 ms; TE, 33 ms; flip angle, 52 (https://www.humanconnectome.org/storage/app/media/documentation/s1200/HCP_S1200_Release_Reference_Manual.pdf).Participants were instructed to lie in the scanner with eyes open, with a "relaxed" fixation on a white cross on a dark background and think of nothing in particular, and to not fall asleep (Smith, Beckmann, et al., 2013).

| AOMIC
The AOMIC dataset includes two subsamples, PIOP1 and PIOP2, comprising data of healthy university students scanned on a Philips 3 T scanner.Participants were instructed to keep their gaze fixated on a fixation cross on the screen and let their thoughts run freely (Snoek et al., 2021).Both samples were acquired with a voxel size of 3 Â 3 Â 3 mm 3 and a matrix size of 80 Â 80.While PIOP1 was acquired for 360 s with multi-slice acceleration, 480 volumes and a 0.75 TR, PIOP2 was acquired for 480 s without multi-slice acceleration, 240 volumes and a 2 s TR (further details in https://www.nature.com/articles/s41597-021-00870-6/tables/10).

| HCP
RS data from the 'HCP S1200 Release' analyzed here was fully preprocessed and denoised via the Connectome Workbench software.In short, data were corrected for spatial distortions, head motion, B 0 distortions and were registered to the T1-weighted structural image (Smith, Beckmann, et al., 2013).Concatenating these transformations with the structural-to-MNI nonlinear warp field resulted in a single warp per time point, which was applied to the timeseries to achieve a single resampling in the 2 mm isotropic MNI space.Afterwards, global intensity normalization was applied and voxels that were not part of the brain were masked out.Locally noisy voxels as measured by the coefficient of variation were excluded and all the data were regularized with 2 mm Full width half maximum (FWHM) surface smoothing (Glasser et al., 2013;Smith, Beckmann, et al., 2013).The temporal preprocessing included corrections and removal of physiological and movement artifacts by an independent component analysis (ICA) of the FMRIB's X-noisifier (FIX, Salimi-Khorshidi et al., 2014).This method decomposes data into independent components and identifies noise components based on a variety of spatial and temporal features through pattern classification.

| GSP, eNKI, 1000Brains
RS data of the GSP, eNKI and 1000Brains samples were preprocessed in the same way.Initially, FSL was used for the removal of noise and motion artifacts by applying the FIX-denoising procedure (Jenkinson et al., 2012;Salimi-Khorshidi et al., 2014) using the appropriate pretrained dataset for noise classification.As FIX does not include normalization to MNI space, denoised data were further preprocessed with SPM12 (SPM12 v6685, Wellcome Centre for Human Neuroimaging, 2018; https://www.fil.ion.ucl.ac.uk/spm/software/spm12/) using Matlab R2014a (Mathworks, Natick, MA).For each subject, the first four echo-planar-imaging (EPI) volumes were discarded and the remaining ones were corrected for head movement by an affine registration with two steps: First, the images were aligned to the first image.Second, the images were aligned to the mean of all volumes.
The mean EPI image was spatially normalized to the MNI152 template (Holmes et al., 1998) using the "unified segmentation" approach (Ashburner & Friston, 2005) and the resulting deformation was applied to the FIX-denoised images and resampled to 2 mm 3 .

| Connectome extraction
Following the parcelwise approach by Weis et al. (2020), individual RS connectomes were extracted based on 400 cortical parcels of the Schaefer Atlas (Schaefer et al., 2018), and 36 subcortical parcels of the Brainnetome Atlas (Fan et al., 2016).Each parcel's time series was cleaned by excluding variance that could be explained by mean white matter and cerebrospinal fluid signal (Satterthwaite et al., 2013).Data was not further cleaned for motion related variance as this variance was already removed during FIX preprocessing.For each of the 436 parcels, the activation time series was computed as the mean of all voxel time courses within that parcel.Then, for each parcel, pairwise Pearson correlations were computed between the parcel's time series and those of all other 435 remaining parcels, representing the individual RS functional connectivity (RSFC) profile of the parcel.All models were built using support vector machine (SVM) classifiers.SVM is a supervised ML method that separates the data into distinct classes with the widest possible gap between these classes (Boser et al., 1992;Rafi & Shaikh, 2013;Vapnik, 1998;Zhang et al., 2021).Based on its operational principles regarding a supervised binary classification task and successful applications in previous sex classification studies (Flint et al., 2020;Weis et al., 2020;Wiersch et al., 2023), SVM is a suitable method for the present task.SVM models were built in Julearn (Hamdan et al., 2023;  22-80) sample.Here, for computing time reasons, we restricted the choice of the SVM kernel to rbf (see Weis et al., 2020).Finally, generalization performance of all six pwCs was assessed on the AOMIC sample.All reported accuracies are balanced accuracies.

| Across-sample classification accuracy
To statistically compare the classification accuracies of pwCs across the different test samples, we employed independent t-tests between the different across-sample accuracies over the respectively 10% highest classifying parcels.Additional analyses using all 436 parcels are reported in the supplements (Table S3 and below

| Consistency of highly classifying brain regions
Previous studies have demonstrated that sex classification accuracies for models trained on parcelwise RSFC patterns do not achieve uniformly high performance across the whole brain (Weis et al., 2020;Zhang et al., 2018).Thus, we assessed generalization performance of the different pwCs by examining the consistency of highly classifying brain regions during CV and across-sample testing.Consistency was assessed by computing Dice coefficients (DSC) to evaluate the similarity in spatial distribution of parcels achieving certain accuracies in both CV and across-sample testing.This consistency was evaluated for different accuracy thresholds above chance (0.5-0.7 at 0.02 steps).For each threshold, Dice coefficients were computed as the number of common parcels achieving within-and across-sample accuracies above or equal to that threshold (p_com) multiplied by 2 and divided by the total number of parcels achieving a within (p_tr) or across-sample (p_te) accuracy above or equal to that accuracy level in CV (Dice, 1945;Sorensen, 1948).
To facilitate comparison of the dice score distributions between the different pwCs and test samples, we summarized each contribution into one score by computing a weighted mean (wmDice) as the average of each dice coefficient weighted by the accuracy threshold for which the respective dice coefficient was calculated.

| RESULTS
The generalization performance of pwCs trained on each of the single dataset samples (HCP, GSP, eNKI, & 1000Brains) and on both compound samples were compared with respect to mean acrosssample accuracy averaged across the best 10% classifying parcels.
Additionally, we evaluated the consistency of the spatial distribution of accurately classifying parcels between CV and acrosssample testing to determine whether pwCs trained on compound samples exhibit more generalizable results in contrast to pwCs trained on single samples.

| Training and test classification accuracies
For the single samples pwCs, the mean within-sample performance across the top 10% classifying parcels was at a similar level for pwC GSP (66.8%), pwC eNKI (66.9%) and pwC 1000Brains (66.3%) and ranged up to 73.5% for pwC HCP.The mean across-sample accuracies averaged for the top 10% classifying parcels ranged between 58.4% (for pwC HCP tested on AOMIC and pwC eNKI tested on 1000Brains) and 65.8% (for pwC GSP tested on eNKI).Details for within-and across-sample performance are reported in Table S1 and Figure 1 and Accuracy maps for the different combinations of training and test samples were compared using independent t-tests across the top 10% classifying parcels in each prediction (details in Table S2).
First, we analyzed differences in classification accuracies between test samples for each pwC (horizontal comparisons, Figure 1): For pwC HCP, testing on 1000Brains achieved the highest mean classification accuracy (59.8%).The averaged accuracy for this test sample was descriptively higher than for the GSP and significantly higher than for the eNKI and AOMIC test samples.PwC GSP achieved significantly higher accuracies for the eNKI test sample (65.8%) than for any other test sample, while pwC eNKI showed highest accuracies for the GSP test sample (60.7%).This across-sample prediction showed descriptively higher accuracies than pwC eNKI did for the HCP test sample and significantly higher accuracies than for the AOMIC and 1000Brains samples.For pwC 1000Brains, testing on the HCP showed significantly higher accuracies (64.8%) than testing on the eNKI, GSP and AOMIC sample.Details of all statistical comparisons are given in Table S2.
PwC compound854 achieved a mean within-sample accuracy of 65.3% for the top 10% classifying parcels, while mean across-sample accuracies of the highest classifying parcels ranged between 62.4% (pwC compound854 tested on AOMIC) and 71.8% (pwC com-pound854 tested on eNKI, further details in Table S1 and Figure 2, Figure S2).PwC compound2190 achieved a mean within-sample accuracy of 67.9% within the top 10% classifying parcels.The mean across-sample accuracies averaged across the top 10% classifying parcels ranged between 65.5% (pwC compound2190 tested on AOMIC) and 74.6% (pwC compound2190 tested on eNKI, details in Table S1 and Figure 2, Figure S2).
Contrasting the top 10% classifying parcels in the accuracy maps of pwC compound854 and pwC compound2190 displayed peaks in accuracies for the eNKI test sample (71.8% and 74.6%) resulting in significantly higher accuracies than for the remaining test samples, respectively (Figure 2 and Table S2).We also contrasted how the six pwCs performed on each test sample by employing independent t-tests: pwC compound 854 outperformed all pwCs trained on single samples for all test samples, except for the AOMIC test sample, where pwC GSP achieved higher accuracies within the best 10% classifying parcels (Table S2).PwC compound 2190 outperformed all other pwCs for the HCP, GSP, eNKI and AOMIC test sample with regards to the top 10% classifying parcels in each across-sample prediction (Figure 2).Details for all statistical comparisons are shown in Table S2.

| Consistency of correctly classifying parcels
To evaluate the spatial consistency of accurately classifying parcels, we calculated the dice coefficient between thresholded within-and across-sample accuracy maps at different levels of accuracy.Here, a high dice coefficient indicates a high overlap in highly classifying parcels between within and across-sample predictions at a given accuracy level.The results are depicted in the blue bar plots in Figure 3.
Regarding spatial consistency within a given pwC (horizontal comparison in Figure 3 pwCs for the HCP, GSP and 1000Brains test samples and pwC com-pound2190 demonstrated higher spatial consistency than the other six pwCs.Dice coefficients for the top 10% classifying parcels are reported in Figure S3.

| DISCUSSION
In the present study, we examined the generalization performance of parcelwise sex classification models trained on different samples.
Here, we operationalized generalization performance in terms of both mean classification accuracy of best classifying parcels during acrosssample testing as well as spatial consistency in highly classifying parcels between CV and across-sample testing.Since not all parcels are expected to achieve high classification accuracies (Weis et al., 2020;Zhang et al., 2018), we mainly focused on the top 10% shows the dice coefficients (left y-axis), representing the overlap of accuracy maps between cross-validation (CV) and test predictions at different accuracy levels (x-axis).For each accuracy-threshold, the respective dice coefficient was calculated as the number of similar parcels classifying above a certain accuracy-threshold in both, respective CV and test prediction, in relation to the total number of parcels of both predictions classifying at this level.For each combination of pwC and test sample, the weighted mean of the dice coefficients (wmDice) across accuracy levels is displayed above the subplot to allow for a straightforward comparison between the distributions of dice coefficients.
of the datasets achieved a high accuracy when tested on the respective other one and vice versa.This was the case for HCP and 1000Brains as well as for GSP and eNKI with the former matching the results of a previous study (Weis et al., 2020).Based on the good across-sample performance of sex classifiers trained on an HCP sample on a subsample of the 1000Brains, Weis et al. (2020) suggested that parcelwise sex classification generalizes well between different samples.No additional samples from other datasets were considered in Weis et al. (2020).The present results extend the findings of the previous study by showing that good generalization performance of the HCP classifiers appears to be specific to the 1000Brains sample.
Generalization to samples from other datasets (GSP, eNKI and AOMIC) is, however, rather poor.Thus, our study demonstrates that the generalizability of pwCs trained on single dataset samples depends on the train-test data combination, which is in line with a previous study that employed sex classification based on regional homogeneity of RS time series (Huf et al., 2014).The limited generalization performance of the pwCs trained on single dataset samples to the majority of test samples from other datasets might be attributed to the homogeneity of each single dataset training sample arising due to demographic factors such as the age range (Damoiseaux, 2017;Damoiseaux et al., 2008;Scheinost et al., 2015) as well as technical details such as fMRI acquisition parameters (Brown et al., 2011;Yu et al., 2018).Homogeneous data characteristics within each dataset will result in a homogeneity of the feature space on which ML models are trained.Such homogeneous features might lead the ML model to learn dataset-specific characteristics that are predictive of the target variable, which might not translate to other test samples, resulting in inaccurate across-sample predictions (Huf et al., 2014).Thus, training ML models on a single, homogenous sample may not be ideal to achieve a good generalization performance on diverse test samples (Belur Nagaraj et al., 2020;Di Tanna et al., 2020;Huf et al., 2014;Janssen et al., 2018).In contrast, training classifiers on a combination of multiple datasets (pwC compound854 and pwC compound2190) achieved significantly higher accuracies for all test samples, including the sample from a dataset which was not included in the compound training sample.We contrasted performances of both pwCs trained on compound samples to evaluate potential sample size effects.Here, pwC compound854 demonstrated higher accuracies and spatial consistency in the majority of across-sample predictions compared to single sample pwCs, but did not outperform pwC compound2190.These results suggest that the sample size of the training sample is an important factor in determining the generalization performance of ML analyses.These results align with the findings of several other studies highlighting the importance of the sample size in ensuring accurate ML results (Cui & Gong, 2018;Dhamala et al., 2023;Domingos, 2012;Ishida, 2019;Yang et al., 2020).However, pwC compound854 still predominantly demonstrated a higher generalization performance compared to single sample pwCs with a similar or even higher sample size.Thus, it is evident that the composition of a training sample is crucial in ensuring generalizable ML results, as reported by previous studies (Chang et al., 2018;Huf et al., 2014;Willemink et al., 2020).
While a high sample size is beneficial to assure reliable and accurate ML predictions (Dhamala et al., 2023), the heterogeneity and representativeness of a composite sample led to significantly better results than single sample pwCs with a higher sample size in the present ML analyses.Thus, the high generalization performance of both compound samples is not only attributable to the sample size but also to the heterogeneity of data characteristics included in a training sample created from various datasets.This heterogeneity likely enables the model to learn patterns that do not rely on specific sample characteristics, but actually capture the underlying relationship between features and target, enabling the model to generalize better, even to data from datasets that were not included in training.Therefore, the heterogeneity of a composite training sample is essential for generalizable ML outcomes and may also serve to minimize sample-specific biases (Li et al., 2022).Thus, training on a compound sample comprising the variability of multiple datasets is preferable to training on single dataset samples in order to achieve high generalization performances (Chang et al., 2018;Huf et al., 2014;Willemink et al., 2020).
While undesirable sources of variability, e.g.due to scanner differences, may be accounted for by using data harmonization (Fortin et al., 2017;Yu et al., 2018), in the present study we intentionally refrained from using harmonization techniques.Here, we evaluated the generalization performances of differently trained pwCs in order to determine which may generalize best to unseen data.Harmonization techniques such as ComBat are not suitable for this purpose because they require a sufficient amount of data from each sample and site (Orlhac et al., 2022).
The parcelwise classification approach allowed us to investigate generalization performance not only in terms of accuracy but also  (Chang et al., 2018;Huf et al., 2014;Nielsen et al., 2020;Willemink et al., 2020).
Overall, the aggregation of multiple samples in pwC com- sample is more than twice as large as compared to any of the single dataset samples.Such high sample size has been shown to be beneficial for generalization (Cui & Gong, 2018;Domingos, 2012;Ishida, 2019;Yang et al., 2020).However, sample size alone is likely not sufficient to explain the high generalization performance.For classifying at a high level for the eNKI test data (up to 83%), resulting in such a high mean accuracy for the top 10% parcels (Figure 2).Furthermore, the mean accuracy averaged across all 436 parcels confirms that there are only few parcels responsible for the high accuracy in the top 10% parcels, as the eNKI dataset did not exhibit the overall highest mean accuracy across all parcels.
In contrast to both compound pwCs, CV and across sample test performances differed considerably for pwCs trained on single dataset samples.This lack of generalization performance was especially apparent for pwC HCP which showed a rather high performance during CV in combination with the lowest generalization performance both with respect to accuracy and spatial consistency.While homogeneity of a data sample has been argued to lead to high CV classification accuracy (Huf et al., 2014), sample characteristics such as the age range were comparable between HCP and the GSP sample, with the latter outperforming HCP in generalization performance.Thus, the comparably poor performance of classifiers trained on the HCP sample may be partially attributed to sample homogeneity but also to other factors such as the differences in preprocessing pipelines.For the HCP sample, connectome extraction was based on the FIX denoised preprocessed version of the data.The eNKI, GSP and 1000Brains samples were preprocessed using the same pipeline in FSL/SPM12 also including FIX-denoising, while the AOMIC sample was preprocessed using fMRIprep without FIX.Given that comparative performance evaluation of fMRI data is sensitive to preprocessing decisions (Bhagwat et al., 2021), it is likely that this difference in preprocessing may contribute to the poor generalization performance of pwC HCP when tested on the other single samples.Furthermore, the high withinsample accuracy coupled with the lack of generalization performance may also indicate an overfitting effect of pwC HCP during training (Cui & Gong, 2018;Domingos, 2012).
The present study, however, does not primarily aim to build a classifier attaining highest sex classification accuracies but rather to evaluate the impact of the training sample in ML models, particularly the size and composition of the training sample.
Altogether, our results highlight the importance of the sample size and also a heterogeneous, diverse, and representative data composition for training ML models (Cui & Gong, 2018;Dhamala et al., 2023;Domingos, 2012;Gong et al., 2019;Li et al., 2022), which can be achieved by combining data from multiple sites and datasets (Chang et al., 2018;Nielsen et al., 2020;Willemink et al., 2020).By minimizing sample-specific biases, we can aim for maximizing the generalizability of ML models.

| Limitations
The present results consistently demonstrated the superior generalizability of sex classifiers trained on compound samples as compared to those trained on single dataset samples, but they come with some limitations.First of all, the high spatial consistency of pwC com-pound2190 might partially be attributed to the generally higher accuracy of the across-sample predictions.Dice coefficients across the top 10% classifying parcels showed a more differentiated pattern.
Here, pwC compound2190 did not always outperform pwCs trained on single samples.Overall, the predominantly higher generalization performance of pwC compound2190 can be attributed to the sample size and sample composition of its training sample.However, an additional systematic study would be required to determine the exact degree to which each factor contributes to high generalization performance.
Another limitation in the present study is that, while we accounted for age as a potential confound during training of the classifiers, there might be other confounds that were not considered.For example, we did not control for structural variables such as brain size, which have been reported to influence brain functions (Batista-Garcia-Ramo & Fernandez-Verdecia, 2018) and RS brain connectivity in particular (Zhang et al., 2018).Thus, in principle, different distributions of brain size within the different samples might have influenced the present results.However, Weis et al. ( 2020) demonstrated that at least with their training sample, classification based on RS connectivity was not systematically influenced by brain size.Still, there might be other demographic variables which differ between samples and might influence classification accuracies (Li et al., 2022;Mehrabi et al., 2021;Sripada et al., 2021).
A further limitation of the present study is the potential impact of different preprocessing approaches which may affect the outcomes in ML analyses.In neuroimaging data, there can be various sources of noise and artifacts.Prior to data analysis, it is necessary to preprocess the data to mitigate these issues and enhance the data quality.However, the impact of preprocessing steps on the outcomes of fMRI analyses has been well documented.For instance, conceptually similar preprocessing packages such as AFNI, FSL, or SPM can produce differences in fMRI results (Bowring et al., 2019).Differences on the level of preprocessing steps may also produce dissimilarities (Carp, 2012).Even differences in the order of preprocessing steps can lead to differences in the graph theoretical outcomes derived from RS functional connectivity (Gargouri et al., 2018).Thus, it is plausible that discrepancies in preprocessing pipelines may lead to differences in classification outcomes.Indeed, one study that compared ML results for patient and healthy control classification across different preprocessing pipelines indicated differences in the classification accuracy (Vergara et al., 2017).Overall, while different preprocessing approaches may lead to differences in the fMRI and ML results, in the present study these differences represent an additional source of variance that may occur when using data of various datasets.Another factor which has not been considered in the present analyses are fluctuating sex hormones, which have been shown to influence functional brain connectivity in RS (Arélin et al., 2015;Haraguchi et al., 2021;Weis et al., 2019).These dynamic changes in female and male connectivity patterns (Coenjaerts et al., 2023;Kogler et al., 2016;McEwen & Milner, 2017) will likely influence overall sex classification accuracies.However, unfortunately, most publicly available datasets do not provide information on hormone levels, making it impossible to consider these variations in the analyses.Future largescale studies should include hormone levels in data acquisition, enabling model training on a combination of multiple independent datasets with well characterized phenotypes to achieve most accurate results.

| CONCLUSION
The present results show that parcelwise sex classification models generalize best when trained on a compound sample including data with different demographic and data acquisition characteristics.Our results demonstrate that a large and heterogeneous training sample including multiple datasets is best suited to achieve accurate generalization performance.This observation carries practical implications for future neuroimaging studies employing ML models for generalizable predictions.
formance of multiple sets of sex classification models derived from different training samples.The different training samples were created from four different datasets with varying demographic characteristics.In addition, sex classifiers were trained on compound samples combining data from all datasets to obtain training samples with heterogeneous sample characteristics.Both compound samples comprise the same ratios of datasets, sex and age distributions, but differ in sample size to additionally assess the effect of training sample size.Following the parcelwise approach by Weis et al. (2020), we trained independent sex classifiers on the resting state (RS) connectivity patterns of 436 parcels covering the whole brain.For each parcel, a sex classification model was built based on the individual connectivity profile, resulting in one classification accuracy value per parcel.This was done for each of the six training samples, resulting in six sets of parcelwise classifiers (pwCs).These pwCs were applied to test samples from the four original datasets and one dataset which was not part of the training samples.Then, accuracy maps, representing the spatial distribution of classification accuracies for each parcel were generated for CV (within-sample accuracy) and for application of the pwCs to the different test samples (across-sample accuracy).The comparison of these accuracy maps enabled us to evaluate generalization performance of classifiers by (i) examining the mean accuracy of all pwCs across the 10% best classifying parcels and (ii) comparing the spatial location of highly classifying parcels between CV and across-sample test.Good generalization performance with regard to spatial consistency is characterized by identical parcels performing well in CV and across-sample of 40.05 (range: 20-85), resulting in a sample size of 854 subjects.This sample size is equal to the GSP sample, but larger than the eNKI and lower than the HCP and 1000Brains samples, therefore representing an intermediate sample size compared to the other data samples.Another sixth sample ("compound2190") was constructed by combining 75% of the HCP, GSP, eNKI and 1000Brains samples resulting in a sample size of 2190 subjects in total.The compound 2190 sample comprised a mean age of 40.10 years (range: 20-85 years).Thus, both compound samples display a large difference in sample size but ratios of dataset representation, sex and age distribution have been maintained.This allows us to evaluate the influence of data composition compared to the sample size of a training sample on the generalization performance of sex classification models.
Sex classification models were trained based on the individual multivariate RSFC profile of each parcel.Specifically, the connectivity values between each parcel and the 435 remaining parcels were used as features to train a sex classification model per parcel, resulting in a set of 436 pwC(Weis et al., 2020).Since each model provides one final accuracy value, one pwC provides an accuracy map covering the entire brain.Training sex classification models based on the connectivity profile of each parcel allows for a reduction of the feature dimensionality for each model (1 Â 436) as compared to training one model based on the overall connectivity profile (436 Â 436).Furthermore, using this parcelwise approach allows us to identify the highest classifying brain regions.In the following steps, we evaluated generalization performance in terms of classification accuracies and spatial consistency of highly classifying parcels across the entire brain.
). Significance levels were Bonferroni-corrected according to the number of dependent tests (15 dependent tests for comparing acrosssample accuracies of all six pwCs on the AOMIC test sample, 10 dependent tests for comparing the across-sample accuracy of both compound pwCs for the five test samples and for comparing pwC performances against each other for each of the five test samples; six dependent tests for all other comparisons).
Figure S1.Parcelwise within-and across-sample accuracies are displayed as accuracy maps in Figure 1a and the distribution of test accuracies is shown in Figure 3 (red boxplots).Here, accuracy maps represent the spatial distribution of classification accuracies resulting from the 436 individual ML models trained on the respective multivariate RSFC profile of each parcel.WIERSCH ET AL.

F
I G U R E 1 Accuracy maps and tile plots of mean accuracies of top 10% classifying parcels for parcelwise classifiers (pwCs) trained on single samples.(a) Spatial distribution of parcelwise sex classification accuracies across the brain.Within-sample accuracies are depicted on and acrosssample accuracies off the diagonal.Only parcels with an accuracy of 0.5 or higher are displayed.(b) Mean accuracies averaged across the top 10% classifying parcels for each cross-validation (CV) and across-sample prediction.
), pwC HCP overall demonstrated relatively low spatial consistency while it was highest for 1000Brains (wmDice = 0.1765, all other wmDice <0.1112).Spatial consistency for pwC GSP was highest for the eNKI sample (wm = 0.3103) and lowest for 1000Brains (wmDice = 0.1810) with spatial consistency for HCP (wmDice = 0.2407) and AOMIC (wmDice = 0.2607) test samples ranging in between.PwC eNKI showed overall low spatial consistency for the HCP, 1000Brains and AOMIC sample (wmDice: 0.1244-0.1523)and highest for the GSP sample (wmDice = 0.2072).Spatial consistency of pwC 1000Brains was lower for the GSP, eNKI and AOMIC test sample (wmDice: 0.1201-0.1853)but considerably higher for the HCP test sample (wmDice = 0.3159).Spatial consistency of pwC compound854 ranged between 0.2865-0.3221for the HCP, GSP, eNKI and 1000Brains sample and achieved 0.2546 for the AOMIC sample.PwC compound2190 demonstrated a relatively similar spatial consistency for HCP, GSP, eNKI and 1000Brains (wmDice: 0.3641-0.4168)and lower spatial consistency with the AOMIC sample (wmDice = 0.2960).Concerning the comparisons within each test sample (vertical comparisons in Figure 3) pwC com-pound854 demonstrated higher spatial consistency than single sample F I G U R E 2 Accuracy maps and tile plots of mean accuracies of top 10% classifying parcels for parcelwise classifiers (pwC) compound 854 and pwC compound 2190.(a) Spatial distribution of parcelwise sex classification accuracies across the brain.Only parcels with an accuracy of 0.5 or higher are displayed.(b) Mean accuracies averaged across the top 10% classifying parcels for the respective cross-validation (CV)-(first column) and across-sample predictions.
classifying parcels.Overall, our results showed that classifiers trained on single dataset samples generalized well only for certain test samples.In contrast, classifiers trained on the compound samples tend to outperform classifiers trained on single dataset samples both in terms of accuracy and consistency of accurately classifying parcels.To evaluate generalization performance with respect to mean classification accuracies of the top 10% classifying parcels, for each pwC, we compared across-sample classification accuracies between the different test samples.Results indicate that certain datasets seem to "match" in the sense that classifiers trained on a sample from one F I G U R E 3 Spatial consistency of all parcelwise classifiers (pwCs).For each combination of training (rows) and test sample (columns), the right side of each subplot (red boxplot) depicts the distribution of accuracies across all parcels (right y-axis).The left side of each subplot (blue barplot) with respect to the spatial distribution of accurately classifying parcels.To quantify the overlap of accurately classifying parcels between CV and across-sample testing, we computed dice coefficients between within-and across sample accuracy maps at different accuracy thresholds.We observed a pattern similar to the one found for classification accuracies, with the train-test pairing of HCP and 1000Brains and GSP and eNKI, respectively, showing highest spatial consistency, relative to other combinations.Thus, also when considering spatial consistency, generalization performance depended on the specific pairing of training and test datasets.For pwCs trained on single samples, training sample characteristics appeared to be the most important factor in driving generalization performance across test samples.In contrast, pwC compound854 achieved superior spatial consistency in most test samples and pwC compound2190 in all test samples, as compared to pwCs trained on single samples.Thus, the classifiers trained on the compound samples achieved both higher classification accuracies as well as more consistency in accurately classifying parcels as opposed to the classifiers trained on single dataset samples.Altogether, the high generalization performance for pwC compound854 and pwC compound2190 can likely be attributed to the data heterogeneity in the respective training samples which was achieved by combining multiple samples for training.These findings match results of previous studies pound854 and pwC compound2190 for training sex classifiers resulted in superior generalization performance compared to pwCs trained on single samples.Firstly, the classification accuracies were comparable between CV and the different across-sample test classifications.Secondly, highly classifying parcels overlapped to a large degree between training and test.The overall high generalization performance of pwC compound2190 across all test samples could be attributed to several possible explanations: first, the compound2190 instance, the eNKI sample consists of only 190 participants, but the classifiers trained on this sample achieved better generalization performance than those trained on the HCP sample, which included 878 participants.In addition, analyses with pwC compound854 also demonstrated a superior generalization performance with respect to classification accuracies as well as spatial consistency compared to single sample pwCs, despite the smaller sample size.A second explanation for the good performance of both compound pwCs may lie in the heterogeneous nature of its training sample as discussed above.Having the different samples represented within the compound sample may have allowed the classifiers to classify sex based on sample-unspecific information.Another potential explanation is that the training samples of pwC compound854 and pwC com-pound2190 partially consist of data from datasets on which we evaluated the test performance.In general, training on data that is representative of the test data typically results in an increased generalization performance (Chung et al., 2018).Here, both training samples for the compound pwCs composed data from four different datasets.Although each dataset had a different sample size and thus a different share in the respective compound training sample, the model applications to the eNKI test sample showed highest accuracies for the best 10% classifying parcels.This result stems from few parcels Despite various potential sources of variance within the training samples of the compound pwCs, pwC compound854 and pwC compound2190 demonstrate a comparatively good performance compared to the single sample pwCs.While it is reasonable to anticipate that aligned preprocessing approaches may improve predictions; however, conducting a systematic evaluation on the effect of preprocessing pipelines is beyond the scope of the present study and remains an important open question for future research.
The parameter search included choice of kernel (linear vs. radial basis function (rbf) kernel) as well as the hyperparameters gamma and C, which is used to set the strength of regularization (https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html).The SVM algorithm used in the present study incorporates a squared L2 regularization.The regularization parameter controls the trade-off between the model fit to the training data and generalizable predictions beyond the training data in order to avoid overfitting and to optimize model performance and generalizability (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).Confounding effects of age were regressed out in a CV-consistent manner by removing age-related variance before training the classifiers.By estimating confound regression models only for training subsets and applying them to training and test sets, leakage of information from test to training data within the CV-process can be avoided (More et al., 2021).The best performing combination of hyperparameters was used for the final model for each individual parcel.Within-sample classification accuracy for each individual parcel was determined by averaging accuracies over CV folds and repetitions.For a cross-sample classification, single dataset pwCs were tested on the respective other three samples, while pwC compound 854 and pwC compound 2190 were tested on the remaining 25% of the HCP (n = 220, mean age: 29.68, age range: 22-36), GSP (n = 214, mean age: 22-72, age range: 21-31), eNKI (n = 48, mean age: 47.52, age range: 20-75) and 1000Brains (n = 250, mean age: 52.08, age range: https://juaml.github.io/julearn/main/index.html) including a hyperparameter search nested within a 10-fold CV with five repetitions.